The list of differentially expressed genes is sometimes so long that its interpretation becomes cumbersome and time consuming. It may also be very short while some genes have low p-value yet higher than the given threshold.
A common downstream procedure to combine information across genes is gene set testing. It aims at finding pathways or gene networks the differentially expressed genes play a role in.
Various ways exist to test for enrichment of biological pathways. We will look into over representation and gene set enrichment analyses.
A gene set comprises genes that share a biological function, chromosomal location, or any other relevant criterion.
To save time and effort there are a number of packages that make applying these tests to a large number of gene sets simpler, and which will import gene lists for testing from various sources.
Today we will use clusterProfiler.
This method tests whether genes in a pathway are present in a subset of our data in a higher number than expected by chance (explanations derived from the clusterProfiler manual).
Genes in the experiment are split in two ways:
We can then create a contingency table with:
And test for independence of the two variables with the Fisher exact test.
clusterProfilerclusterprofiler (Yu et al. 2012) supports direct online access of the current KEGG database (KEGG: Kyoto Encyclopedia of Genes and Genomes), rather than relying on R annotation packages. It also provides some nice visualisation options.
We first search the resource for mouse data:
library(tidyverse)
library(clusterProfiler)
search_kegg_organism('mouse', by='common_name')
## kegg_code scientific_name
## 26 mmur Microcebus murinus
## 30 mmu Mus musculus
## 31 mcal Mus caroli
## 32 mpah Mus pahari
## 34 mcoc Mastomys coucha
## 40 pleu Peromyscus leucopus
## 50 plop Perognathus longimembris pacificus
## 113 mmyo Myotis myotis
## 188 csti Colius striatus
## 5722 asf Candidatus Arthromitus sp. SFB-mouse-Japan
## 5723 asm Candidatus Arthromitus sp. SFB-mouse-Yit
## 5724 aso Candidatus Arthromitus sp. SFB-mouse-NL
## common_name
## 26 gray mouse lemur
## 30 house mouse
## 31 Ryukyu mouse
## 32 shrew mouse
## 34 southern multimammate mouse
## 40 white-footed mouse
## 50 Pacific pocket mouse
## 113 greater mouse-eared bat
## 188 speckled mousebird
## 5722 Candidatus Arthromitus sp. SFB-mouse-Japan
## 5723 Candidatus Arthromitus sp. SFB-mouse-Yit
## 5724 Candidatus Arthromitus sp. SFB-mouse-NL
We will use the ‘mmu’ ‘kegg_code’.
The input for the KEGG enrichment analysis is the list of gene IDs of significant genes.
We now load the R object keeping the outcome of the differential expression analysis for the d11 contrast.
res_d11_shrink <- readRDS("preprocessed/r_objects/res_infected_vs_uninfected_d11_annotated_shrink.rds")
We will only use genes that have:
We need to remember to eliminate genes with missing values in the FDR as a result of the independent filtering by DESeq2.
For this tool we need to use Entrez IDs, so we will also need to eliminate genes with a missing Entrez ID (NA values in the ‘Entrez’ column).
sig_genes_entrez <- res_d11_shrink %>%
drop_na(Entrez, padj) %>%
filter(padj < 0.05 & abs(log2FoldChange) > 1) %>%
pull(Entrez)
kegg_results <- enrichKEGG(gene = sig_genes_entrez, organism = 'mmu')
## Reading KEGG annotation online: "https://rest.kegg.jp/link/mmu/pathway"...
## Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/mmu"...
as_tibble(kegg_results)
## # A tibble: 74 × 14
## category subcategory ID Description GeneRatio BgRatio RichFactor
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Human Diseases Infectious… mmu0… Herpes sim… 63/351 213/10… 0.296
## 2 Organismal Systems Immune sys… mmu0… Antigen pr… 39/351 87/105… 0.448
## 3 Human Diseases Infectious… mmu0… Epstein-Ba… 55/351 228/10… 0.241
## 4 Human Diseases Immune dis… mmu0… Graft-vers… 31/351 60/105… 0.517
## 5 Human Diseases Endocrine … mmu0… Type I dia… 32/351 67/105… 0.478
## 6 Cellular Processes Transport … mmu0… Phagosome 47/351 183/10… 0.257
## 7 Human Diseases Immune dis… mmu0… Allograft … 30/351 60/105… 0.5
## 8 Human Diseases Infectious… mmu0… Influenza A 45/351 174/10… 0.259
## 9 Environmental Inf… Signaling … mmu0… Cell adhes… 44/351 181/10… 0.243
## 10 Human Diseases Infectious… mmu0… Leishmania… 28/351 70/105… 0.4
## # ℹ 64 more rows
## # ℹ 7 more variables: FoldEnrichment <dbl>, zScore <dbl>, pvalue <dbl>,
## # p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>
clusterProfiler has a function browseKegg to view the KEGG pathway in a browser, highlighting the genes we selected as differentially expressed.
We will show one of the top hits: pathway ‘mmu04612’ for ‘Antigen processing and presentation’.
browseKEGG(kegg_results, 'mmu04612')
The package pathview (Luo et al. 2013) can be used to generate figures of KEGG pathways.
One advantage over the clusterProfiler browser function browseKEGG() is that genes can be coloured according to fold change levels in our data. To do this we need to pass pathview a named vector of fold change values (one could in fact colour by any numeric vector, e.g. p-value).
The package plots the KEGG pathway to a png file in the working directory.
library(pathview)
lfc <- res_d11_shrink$log2FoldChange
names(lfc) <- res_d11_shrink$Entrez
pathview(gene.data = lfc,
pathway.id = "mmu04612",
species = "mmu",
limit = list(gene = 20, cpd = 1))
mmu04612.pathview.png:
mmu04612 - Antigen processing and presentation
pathview to export a figure for “mmu04659” or “mmu04658”, but this time only use genes that are statistically significant at padj < 0.01Answer
First, we need to create a new list of genes using the new filter:
lfc <- res_d11_shrink %>%
drop_na(padj, Entrez) %>%
filter(padj < 0.01) %>%
pull(log2FoldChange, Entrez)
Then, we can use the pathview() function as demonstrated above:
pathview(gene.data = lfc,
pathway.id = "mmu04659",
species = "mmu",
limit = list(gene = 5, cpd = 1))
## 'select()' returned 1:1 mapping between keys and columns
## Info: Working in directory /mnt/c/Users/hugot/Documents/crit/repos/rnaseq-course-nxf/course_files
## Info: Writing image file mmu04659.pathview.png
This outputs an image to the local working directory.
mmu04659 - Th17 cell differentiation
clusterProfiler can also perform over-representation analysis on GO terms using the function enrichGO(). For this analysis we will use Ensembl gene IDs instead of Entrez IDs and in order to do this we need to load another package which contains the mouse database called org.Mm.eg.db.
To run the GO enrichment analysis, this time we also need a couple of extra things. Firstly, we should provide a list of the ‘universe’ of all the genes in our DE analysis not just the ones we have selected as significant.
Gene Ontology terms are divided into 3 categories.
For this analysis we will narrow our search terms in the ‘Biological Processes’ Ontology so we can add the parameter “BP” with the ‘ont’ argument (the default is Molecular Functions).
library(org.Mm.eg.db)
# get ENSEMBL IDs for significant genes
sig_genes_ensembl <- res_d11_shrink %>%
drop_na(padj) %>%
filter(padj < 0.01 & abs(log2FoldChange) > 2) %>%
pull(GeneID)
# get names for all the genes we have data for
universe <- res_d11_shrink$GeneID
# run enrichment analysis
go_results <- enrichGO(gene = sig_genes_ensembl,
universe = universe,
OrgDb = org.Mm.eg.db,
keyType = "ENSEMBL",
ont = "BP",
pvalueCutoff = 0.01,
readable = TRUE)
We can use the barplot() function to visualise the results. Count is the number of differentially expressed in each gene ontology term.
barplot(go_results, showCategory = 20)
An alternative is the dotplot() version, which can be more informative. Gene ratio is Count divided by the number of genes in that GO term.
dotplot(go_results, font.size = 14)
Another visualisation that can be nice to try is the emapplot() which shows the overlap between genes in the different GO terms.
library(enrichplot)
## enrichplot v1.28.2 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
##
## Please cite:
##
## Guangchuang Yu, Li-Gen Wang, Yanyan Han and Qing-Yu He.
## clusterProfiler: an R package for comparing biological themes among
## gene clusters. OMICS: A Journal of Integrative Biology. 2012,
## 16(5):284-287
ego_pt <- pairwise_termsim(go_results)
emapplot(ego_pt)
Gene Set Enrichment Analysis (GSEA) identifies gene sets that are enriched in the dataset between samples (Subramanian et al. 2005).
The software is distributed by the Broad Institute and is freely available for use by academic and non-profit organisations. The Broad also provide a number of very well curated gene sets for testing against your data - the Molecular Signatures Database (MSigDB). These are collections of human genes. Fortunately, these lists have been translated to mouse equivalents by the Walter+Eliza Hall Institute Bioinformatics service and made available for download. They are now also available from a recent R package msigdbr, which we will use.
Let’s load msigdbr now.
library(msigdbr)
The analysis is performed by:
The article describing the original software is available here, while this commentary on GSEA provides a shorter description.
We will use clusterProfiler’s GSEA package (Yu et al. 2012) that implements the same algorithm in R.
We need to provide GSEA with a vector containing values for a given gene mtric, e.g. log(fold change), sorted in decreasing order.
To start with we will simply use a rank the genes based on their fold change.
We must exclude genes with no Ensembl ID.
Also, we should use the shrunk LFC values.
ranked_genes <- res_d11_shrink %>%
drop_na(GeneID, padj, log2FoldChange) %>%
mutate(rank = log2FoldChange) %>%
arrange(desc(rank)) %>%
pull(rank, GeneID)
We will load the MSigDB Hallmark gene set with msigdbr, setting the category parameter to ‘H’ for Hallmark gene set. The object created is a tibble with information on each {gene set; gene} pair (one per row). We will only keep the the gene set name, gene Ensembl ID.
term2gene <- msigdbr(species = "Mus musculus", category = "H") %>%
dplyr::select(gs_name, ensembl_gene)
## Using human MSigDB with ortholog mapping to mouse. Use `db_species = "MM"` for mouse-native gene sets.
## This message is displayed once per session.
## Warning: The `category` argument of `msigdbr()` is deprecated as of msigdbr 10.0.0.
## ℹ Please use the `collection` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
term2name <- msigdbr(species = "Mus musculus", category = "H") %>%
dplyr::select(gs_name, gs_description) %>%
distinct()
Arguments passed to GSEA include:
gsea_results <- GSEA(ranked_genes,
TERM2GENE = term2gene,
TERM2NAME = term2name,
pvalueCutoff = 1.00,
minGSSize = 15,
maxGSSize = 500)
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## leading edge analysis...
## done...
Let’s look at the top 10 results, based on the normalised enrichment score (NES):
gsea_results %>%
# convert to regular data.frame/tibble
as_tibble() %>%
# get the top genes based on adjusted p-value
top_n(10, wt = -p.adjust) %>%
# sort by normalised enrichment score
arrange(desc(abs(NES)))
The enrichment score plot displays along the x-axis that represents the decreasing gene rank:
gseaplot(gsea_results,
geneSetID = "HALLMARK_INFLAMMATORY_RESPONSE",
title = "HALLMARK_INFLAMMATORY_RESPONSE")
Remember to check the GSEA article for the complete explanation.
Another common way to rank the genes is to order by p-value while sorting, so that upregulated genes are at the start and downregulated at the end. You can do this combining the sign of the fold change and the p-value.
Rank the genes by statistical significance - you will need to create a new ranking value using -log10(pvalue) * sign(log2FoldChange).
Run GSEA using the new ranked genes and the H pathways.
Conduct the same analysis for the day 33 Infected vs Uninfected contrast. Make sure to first load the results using the following:
# read d33 data in
res_d33_shrink <- readRDS("preprocessed/r_objects/res_infected_vs_uninfected_d33_annotated_shrink.rds")Answer
First load the pathway details if you have not already done so.
library(msigdbr)
term2gene <- msigdbr(species = "Mus musculus", category = "H") %>%
dplyr::select(gs_name, ensembl_gene)
term2name <- msigdbr(species = "Mus musculus", category = "H") %>%
dplyr::select(gs_name, gs_description) %>%
distinct()
1.
We can generate a new table of ranked genes using the new ranking based on p-values:
# rank genes
ranked_genes_pval <- res_d11_shrink %>%
drop_na(GeneID, pvalue, log2FoldChange) %>%
mutate(rank = -log10(pvalue) * sign(log2FoldChange)) %>%
arrange(desc(rank)) %>%
pull(rank, GeneID)
The result is a named vector with sorted p-values. Remember that a high -log10(pvalue) corresponds to a very low (i.e. significant) p-value. By multiplying -log10(pvalue) by the sign of the log2FoldChange we ensure that our p-values are ordered from upregulated to downregulated genes. Here is a peek at the first (head()) and last (tail()) few genes of our ranked gene vector:
head(ranked_genes_pval)
## ENSMUSG00000046879 ENSMUSG00000078853 ENSMUSG00000069874 ENSMUSG00000060586
## 119.08761 111.37314 110.87312 110.33734
## ENSMUSG00000024610 ENSMUSG00000054072
## 96.34973 95.07151
tail(ranked_genes_pval)
## ENSMUSG00000066443 ENSMUSG00000040488 ENSMUSG00000004891 ENSMUSG00000021342
## -7.433218 -7.710282 -7.728996 -8.147132
## ENSMUSG00000066842 ENSMUSG00000020154
## -8.628705 -11.697586
2.
Once we have our ranked genes, we can use the GSEA() function as demonstrated earlier:
# conduct analysis:
gsea_res_pval <- GSEA(ranked_genes_pval,
TERM2GENE = term2gene,
TERM2NAME = term2name,
pvalueCutoff = 1.00,
minGSSize = 15,
maxGSSize = 500)
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some of the pathways the P-values were likely overestimated. For
## such pathways log2err is set to NA.
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some pathways, in reality P-values are less than 1e-10. You can
## set the `eps` argument to zero for better estimation.
## leading edge analysis...
## done...
To view the results we can convert this object to a regular data.frame/tibble, extract the top results based on p-value and sort them by the normalised enrichment score:
gsea_res_pval %>%
as_tibble() %>%
top_n(10, wt=-p.adjust) %>%
arrange(desc(abs(NES)))
3.
We repeat the analysis we just did for d11, but using the d33 contrast instead:
# rank genes
ranked_genes_d33 <- res_d33_shrink %>%
drop_na(GeneID, pvalue, log2FoldChange) %>%
mutate(rank = -log10(pvalue) * sign(log2FoldChange)) %>%
arrange(desc(rank)) %>%
pull(rank, GeneID)
# perform analysis
gsea_res_d33 <- GSEA(ranked_genes_d33,
TERM2GENE = term2gene,
TERM2NAME = term2name,
pvalueCutoff = 1.00,
minGSSize = 15,
maxGSSize = 500)
## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
## preparing geneSet collections...
## GSEA analysis...
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : There were 3 pathways for which P-values were not calculated
## properly due to unbalanced (positive and negative) gene-level statistic values.
## For such pathways pval, padj, NES, log2err are set to NA. You can try to
## increase the value of the argument nPermSimple (for example set it nPermSimple
## = 10000)
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some of the pathways the P-values were likely overestimated. For
## such pathways log2err is set to NA.
## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some pathways, in reality P-values are less than 1e-10. You can
## set the `eps` argument to zero for better estimation.
## leading edge analysis...
## done...
View the results:
as_tibble(gsea_res_d33) %>%
top_n(10, wt = -p.adjust) %>%
arrange(desc(abs(NES)))
Luo, Weijun, Brouwer, and Cory. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14): 1830–1. https://doi.org/10.1093/bioinformatics/btt285.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.
Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “ClusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.